Sampling & Probability

Sampling

Research Question

Every research project aims to answer a research question (or multiple questions).

Do ECU students who exercise regularly have a higher GPA?

Population

Each research question aims to examine a population.

Population for this research question is ECU students.

Survey Questions

  • Do you exercise at least once a week?
  • What is your GPA?

Sampling

  • It is impossible to study the whole population related to a research question.

  • A sample \(n\) is a subset of the population \(N\).

  • The Goal: Select a representative sample to generalize to the broader population.

What is representative?

Convenience Sampling

  • Sample is easy to access.


  • Example:
    • Stand in front of Joyner Library.
    • Give the survey to 100 ECU students.


  • Issue:
    • Will introduce sampling bias

PSA x2

Data quality matters more than data quantity


Many anthropological studies (or similar) are convenience based.

Simple Random Sample

Every member of a population has an equal chance of being selected.

  • Examples:
    • Reach out to the registrar for student emails
    • Randomly select 100 students
    • Email students the survey

Simple Random Sampling in R

sample(1:100, 3, replace = FALSE)
[1] 18 63 58

To Generalize:

sample(x = 1:N, size = n, replace = FALSE)

Systematic Sampling

Similar to a simple random sample BUT intervals are chosen at regular intervals.

# 1. Create a population (e.g., a vector of 1 to 1000)
population <- 1:1000

# 2. Define the desired sample size
sample_size <- 100

# 3. Calculate the sampling interval (k)
N <- length(population) # Population size
k <- N / sample_size
# If k is not an integer, you might use ceiling(N/n) and adjust the logic

# 4. Choose a random starting point (r) between 1 and k
set.seed(123) # Optional: for reproducible results
start_point <- sample(1:k, 1)

# 5. Select every k-th element starting from the random start point
systematic_sample_indices <- seq(from = start_point, to = N, by = k)
systematic_sample <- population[systematic_sample_indices]

# 6. View the first few elements and the dimension of the sample
head(systematic_sample)
[1]  3 13 23 33 43 53
length(systematic_sample) # Should be the desired sample size (100)
[1] 100

Subgroups

Stratified Random Sampling in R

library(dplyr)

# Sample data
set.seed(123) # For reproducibility
data <- data.frame(
  ID = 1:100,
  Gender = sample(c("Male", "Female"), 100, replace = TRUE),
  Income = rnorm(100, mean = 50000, sd = 10000)
)
# Stratified sampling with sample_n()
sampled_data_n <- data %>%
  group_by(Gender) %>%
  sample_n(10)

# View the sampled data
# sampled_data_n %>% count(Gender)

Single Stage Cluster Sampling in R

set.seed(123)

population <- data.frame(
  Supermarket = paste("Supermarket", 1:1000, sep = "_"),
  CustomerSatisfaction = rnorm(1000, mean = 75, sd = 10)
)

selected_supermarkets <- sample(population$Supermarket, size = 10, replace = FALSE)

sampled_data <- population[population$Supermarket %in% selected_supermarkets, ]

head(sampled_data)
        Supermarket CustomerSatisfaction
203 Supermarket_203             72.34855
225 Supermarket_225             71.36343
255 Supermarket_255             90.98509
354 Supermarket_354             76.16637
457 Supermarket_457             86.10277
554 Supermarket_554             77.49825

2-Stage Cluster Sampling

set.seed(123)

region <- data.frame(
  Neighborhood = paste("Neighborhood", 1:500, sep = "_"),
  AverageIncome = rnorm(500, mean = 50000, sd = 10000)
)

households <- data.frame(
  Neighborhood = rep(sample(region$Neighborhood, size = 500, replace = TRUE), each = 20),
  HouseholdID = rep(1:20, times = 500),
  EmploymentStatus = sample(c("Employed", "Unemployed"), size = 10000, replace = TRUE)
)

selected_neighborhoods <- sample(region$Neighborhood, size = 5, replace = FALSE)

sampled_households <- households[households$Neighborhood %in% selected_neighborhoods, ]

head(sampled_households)
         Neighborhood HouseholdID EmploymentStatus
1981 Neighborhood_302           1       Unemployed
1982 Neighborhood_302           2         Employed
1983 Neighborhood_302           3         Employed
1984 Neighborhood_302           4         Employed
1985 Neighborhood_302           5       Unemployed
1986 Neighborhood_302           6       Unemployed

Multi-Stage Cluster Sampling

set.seed(123)
states <- data.frame(
  State = paste("State", 1:50, sep = "_"),
  Population = sample(1000000:5000000, 50, replace = TRUE)
)
counties <- data.frame(
  State = rep(sample(states$State, size = 50, replace = TRUE), each = 20),
  County = rep(paste("County", 1:20, sep = "_"), times = 50),
  VaccinationRate = rnorm(1000, mean = 70, sd = 5)
)
selected_states <- sample(states$State, size = 3, replace = FALSE)
selected_counties <- sample(counties$County[counties$State %in% selected_states], size = 5, replace = FALSE)
sampled_vaccination_centers <- counties[counties$County %in% selected_counties, ]
head(sampled_vaccination_centers)
      State    County VaccinationRate
8  State_32  County_8        70.37428
11 State_32 County_11        66.86024
13 State_32 County_13        70.81309
15 State_32 County_15        67.68222
19 State_32 County_19        70.91839
28 State_46  County_8        68.84869

# 1. Define Population Distribution (e.g., skewed population)
set.seed(123)
population <- rgamma(100000, shape = 2, scale = 2)

# 2. Take a Sample Distribution (e.g., 100 individuals)
sample_data <- sample(population, 100)

From Sampling to Probability

How do we infer future events or population characteristics?

Random Process

In a random process there is more than one possible outcome.

  • The deterministic prediction of the outcome is difficult to impossible.
sample(x = c("H", "T"), size = 10, replace = T)
 [1] "T" "T" "T" "T" "H" "T" "H" "T" "H" "H"
sample(x = c(1:6), size = 10, replace = T)
 [1] 3 6 5 6 3 5 4 1 5 2

Random vs. Deterministic Processes

Sample Space

The set of all possible outcomes of a random process.

Event

An event is a subset of the sample space.

  • Examples with a 6-sided die:

    • Let A represent the event that a single roll die results in an even number.
      • A = {2, 4, 6}
    • Let B represent the event that a single roll die results in an odd number.
      • B = {1, 3, 5}
    • Let C represent the event that a single roll die results in a prime number.
      • C = {2, 3, 5}

Complement of an Event

The set of all outcomes in the sample space that are not in the event itself.

  • Example:

    • Let C represent the event that a single roll die results in a prime number.
      • C = {2, 3, 5}
    • Notation: \(C^C\)
      • C complement
    • \(C^C\) = {1, 4, 6}

Mutually Exclusive or Disjoint Events

    • Let A represent the event that a single roll die results in an even number.
      • A = {2, 4, 6}
    • Let B represent the event that a single roll die results in an odd number.
      • B = {1, 3, 5}
    • Let C represent the event that a single roll die results in a prime number.
      • C = {2, 3, 5}

Events \(A\) and \(B\) are mutually exclusive because an outcome cannot be both even + odd.

Events \(A\) and \(C\) are not mutually exclusive because the outcome 2 is both even + prime.

Events & Set Notation



Description Notation Reading Elements
Union \(A \cup C\) A or C {2, 3, 4, 5, 6}
Intersection \(A \cap C\) A and C {2}

set.seed(1)
OBV <- 1:10
Dist1 <- NULL
Dist9 <- NULL
Dist16 <- NULL
Dist25 <- NULL
Dist36 <- NULL
count = 100
while(count > 0){Dist1 <- c(Dist1,sample(OBV, 1, replace = TRUE)); count <- count - 1}
count = 100
while(count > 0){Dist9 <- c(Dist9,mean(sample(OBV, 9,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist16 <- c(Dist16,mean(sample(OBV, 16,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist25 <- c(Dist25,mean(sample(OBV, 25,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist36 <- c(Dist36,mean(sample(OBV, 36,replace = TRUE) ) ); count <- count - 1}
Dist.df <- data.frame(Size = factor(rep(c(1,9,16,25,36), each=100)), Sample_Means = c(Dist1, Dist9, Dist16, Dist25, Dist36) )
ggplot(Dist.df, aes(Sample_Means, fill = Size)) + geom_histogram() + facet_grid(. ~ Size)

What is probability?

The likelihood of some event occurring…

How do you view your problem?

  • Option 1: Your answer is based on the frequency of events



  • Option 2: Your answer is based upon your degree of belief in your data AND the system at hand.

Frequentist Probability

The probability of an outcome is defined to be the proportion of times the outcome is observed under high number of repetitions of the random process.

Assume that we are repeating the random process of a coin flip and are recording \(X\), the number of heads in \(n\) coin flips. Then:

\[ P(H) = \lim_{n\to\infty}\frac{X}{n} \]

\[ P(H) =\frac{1}{2} \]

set.seed(42)  # for reproducibility

# Number of coin flips
n_flips <- 10000

# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)

# Compute cumulative mean of rolling a '1'
cumulative_mean <- cumsum(flips == "H") / (1:n_flips)

# Plot convergence
plot(1:n_flips, cumulative_mean, type = "l", col = "blue", lwd = 2,
     xlab = "Number of Flips", ylab = "Proportion of Heads",
     main = "Law of Large Numbers: Convergence to 1/2")
abline(h = 1/2, col = "red", lty = 2, lwd = 2)  # Reference line at 1/2

Bayesian Probability

The probability of an outcome is a degree of belief or reasonable expectation quantifying one’s state of knowledge based on observed events and prior knowledge.

set.seed(42)  # For reproducibility

# Number of dice rolls
n_flips <- 10000

# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)

# Prior: Beta(1,5) (weak prior belief about p = 1/6)
alpha <- 1  # prior successes (rolling a 1)
beta <- 2   # prior failures (rolling 2-6)

# Store posterior mean estimates
posterior_means <- numeric(n_flips)

# Bayesian updating
for (i in 1:n_flips) {
  alpha <- alpha + (flips[i] == "H")  # Increase count if roll == 1
  beta <- beta + (flips[i] != "H")    # Increase count otherwise
  posterior_means[i] <- alpha / (alpha + beta)  # Compute posterior mean
}

# Plot posterior mean convergence
plot(1:n_flips, posterior_means, type = "l", col = "blue", lwd = 2,
     xlab = "Number of Rolls", ylab = "Posterior Mean of p(rolling a 1)",
     main = "Bayesian Law of Large Numbers")
abline(h = 1/2, col = "red", lty = 2, lwd = 2)  # True probability reference line

Axioms of Probability

  1. The probability of an event is between 0 and 1 (non-negativity).

\[ 0 \leq P(A) \leq 1 \]

  1. The probabilities must add up to 1. (normalization / unitarity)

\[ P(S) = 1 \]

Axioms of Probability

  1. The probability of mutually exclusive events is additive.

\[ \bigcup\limits_{i=1}^{\infty} A_{i} = \Sigma_{i =1}^\infty P(A_i) \]

Probability of Complementary Events

\[ P(A) + P(A^C) = 1 \]

If we know the the probability that somebody owns a bike is 0.08, then we would know that the probability that somebody does not own a bike is 0.92.

Independence and Multiplication Rule

Two processes are independent if knowing about the outcome of one does not help predict the outcome of the other.

Example:
2 flips of a coin

If events \(A\) and \(B\) are from two independent processes:

\[ P(A \cap B) = P(A) \times P(B) \]

The probability of getting 2 heads in 2 flips:

\[ \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]

Addition Rule

If mutually exclusive \[ P(B\cup C) = P(B) + P(C) \]

If not mutually exclusive \[ P(B\cup C) = P(B) + P(C) - P(B \cap C) \]

Example: What is the probability of a heart or a king?

\[ P(H\cup K) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \]

Example: Data - GSS 2018

The General Social Survey (GSS) is a sociological survey that has been regularly conducted since 1972. It is a comprehensive survey that provides information on experiences of residents of the United States.

Belief in Life After Death
Yes No
Total
College Science Course
Yes 375 75 450
No 485 115 600
Total 860 190 1050

Events

Let \(B\) represent an event that a randomly selected person in this sample believes in life after death.


Let \(C\) represent an event that a randomly selected person in this sample took a college level science course.

Random Variable

A numeric outcome of a random process is called a random variable.

Joint Probability Distribution

Describes the probability of 2 or more random variables occurring.

Note that events \(B\) and \(C\) are not mutually exclusive. A randomly selected person can believe in life after death and might have taken a college science course. \(B \cap C \neq \emptyset\)

\[ P(B \cap C) = \frac{375}{1050} \]

Note that \(P(B\cap C) = P(C\cap B)\). Order does not matter.

Marginal Probability

\(P(B)\) represents a marginal probability. So to does \(P(C)\), \(P(B^C)\), and \(P(C^C)\). In order to calculate these probabilities, we could only use values in the margins of the contingency table:

\[ P(B) = \frac{860}{1050} \]

\[ P(C) = \frac{450}{1050} \]

Conditional Probability

\(P(B|C)\) represents a conditional probability. So to does \(P(B^C|C)\), \(P(C|B)\), and \(P(C|B^C)\). To calculate the probabilities we focus on the row or the column of the given information. We reduce the sample space to this given information.


Probability that a randomly selected person believes in life after death given that they have taken a college science course:

\[ P(B|C) = \frac{375}{450} \]

Conditioning

  • One is isolating the effects of a specific variable.

  • Order Matters!

    • \(P(\text{like dogs | has a dog}) \neq P(\text{has a dog | like dogs})\)

Connecting data to probability

Likelihood Function

\(P(Y|\theta)\) or \(\mathcal{L}\)\((Y|\theta)\)

  • \(Y\) are observed data.
  • \(\theta\) are parameters of the data (i.e, mean, standard deviation, hypothesis,etc.)


The probability of the observed data, given the hypothesis / parameters.

Does NOT sum to 1

SCIENCE!

Most scientific questions stem from the likelihood function.

  • AKA, how does the data match the hypothesis?

Distinction: Downstream interpretations relate to how one constructs (and understands a likelihood).

  • Option A: \(P(Data | \mathbf{Hypothesis})\) - Hypothesis fixed, data varies - Frequentist

  • Option B: \(P(\mathbf{Data} | Hypothesis)\) - Data fixed, hypothesis varies - Bayesian